2.2 Data source and cleaning

2.2.1 Data Cleaning

Cleaning 

Before initiating the Exploratory Data Analysis (EDA) phase, it was imperative to confirm that our dataset was clean and structured appropriately. The steps followed are listed as below:

  • Filter to required subreddits - r/Dogecoin and r/Cryptocurrency

    • Get all posts from r/Dogecoin.

    • Get only posts from r/Cryptocurrency which contain ‘doge’ or ‘dogecoin’.

  • Remove posts with missing values or ‘[deleted]’.

  •  Convert dates from Unix format to a YYYY-mm-dd-hh format (this is needed for time-specific analyses)

Merging

We merged the submissions and comments datasets, based on the post ID, which is stored as ID in the submissions dataset, and link_id in the comments dataset. The entire data cleaning process is documented in the ‘project_eda_cleaning.ipynb’ notebook. After merging, the characteristics of the dataset are listed below.

Summary

The dataset has 587,972 rows and 19 columns. The majority of posts and comments are from r/dogecoin (487037) and the rest are from r/cryptocurrencies (100935)

Variable list

The schema and the variable types are listed below.

subreddit: string (nullable = true)

subreddit_id: string (nullable = true)

id: string (nullable = true)

created_utc: long (nullable = true)

author: string (nullable = true)

is_self: boolean (nullable = true)

num_comments: long (nullable = true)

score: long (nullable = true)

selftext: string (nullable = true)

title: string (nullable = true)

com_subreddit: string (nullable = true)

com_subreddit_id: string (nullable = true)

com_id: string (nullable = true)

com_created_utc: long (nullable = true)

com_author: string (nullable = true)

com_link_id: string (nullable = true)

com_score: long (nullable = true)

com_body: string (nullable = true)

com_submis_id: string (nullable = true)

Generate New Variables:

We created multiple new variables to use in the analysis, as described below.

  • Buy signals (buy_sig): If either the post or any of its comments contains any of these keywords: buy|bought|moon|hold|call|bull|like|yolo

  • Contains ‘doge| dogecoin’: If a post/comment mentions the word ‘doge’

  • Post activity per minute (hour): The average number of comments made on a post per minuter (hour). Divide the total number of comments by the duration between the timestamp when the post was created and the timestamp of the last comment on the post.

  • Day, month and hour: As described above Convert utc_time to yyyy-mm-dd-hh

  • Percentage of post of r/dogecoin (pct_post_rdoge): Proportion of post in different subreddits

2.2.2 Price Query

Dogecoin vs. Bitcoin Price

The graph shows the price fluctuation of bitcoin and dogecoin in 2022
The analysis reveals that both Bitcoin and Dogecoin experienced a decline in value in 2023, coinciding with the broader transition from a bullish to a bearish market within the cryptocurrency domain. Notably, Dogecoin exhibited greater volatility compared to Bitcoin. This heightened fluctuation can be attributed to Dogecoin’s valuation being significantly influenced by community sentiment rather than intrinsic economic factors. A particularly intriguing observation was Dogecoin’s price surge during the FTX crisis, suggesting potential responsiveness to specific market events.

Dogecoin vs. Bitcoin Growth Rate

To quantify the observed trends, we computed the growth rate based on periodic differences. This calculation reinforces the preliminary findings, highlighting Dogecoin’s pronounced susceptibility to fluctuations in response to market events, such as regulatory changes or major scandals. The comparative analysis underscores the distinct behavioral patterns of Bitcoin and Dogecoin within the same market conditions, offering valuable insights into the dynamics of cryptocurrency markets. This research contributes to the academic discourse by elucidating the factors driving volatility in digital currencies, with a particular focus on the influence of community engagement and external events on market behavior.